{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 数据集的预处理，使用dynamic padding构造batch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 试着训练一两条样本"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import torch\n",
    "torch.cuda.is_available()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']\n",
      "- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
      "- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
      "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']\n",
      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
     ]
    }
   ],
   "source": [
    "from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification\n",
    "\n",
    "# Same as before\n",
    "checkpoint = \"bert-base-uncased\"\n",
    "tokenizer = AutoTokenizer.from_pretrained(checkpoint)\n",
    "model = AutoModelForSequenceClassification.from_pretrained(checkpoint)\n",
    "sequences = [\n",
    "    \"I've been waiting for a HuggingFace course my whole life.\",\n",
    "    \"This course is amazing!\",\n",
    "]\n",
    "batch = tokenizer(sequences, padding=True, truncation=True, return_tensors=\"pt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "batch['labels'] = torch.tensor([1, 1])  # tokenizer出来的结果是一个dictionary，所以可以直接加入新的 key-value\n",
    "\n",
    "optimizer = AdamW(model.parameters())\n",
    "loss = model(**batch).loss  #这里的 loss 是直接根据 batch 中提供的 labels 来计算的，回忆：前面章节查看 model 的输出的时候，有loss这一项\n",
    "loss.backward()\n",
    "optimizer.step()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 从Huggingface Hub中加载数据集\n",
    "\n",
    "MRPC (Microsoft Research Paraphrase Corpus) dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Reusing dataset glue (C:\\Users\\Administrator\\.cache\\huggingface\\datasets\\glue\\mrpc\\1.0.0\\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "14286509d57343f3bc94a8e2f7bb3c64",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/3 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "DatasetDict({\n",
       "    train: Dataset({\n",
       "        features: ['sentence1', 'sentence2', 'label', 'idx'],\n",
       "        num_rows: 3668\n",
       "    })\n",
       "    validation: Dataset({\n",
       "        features: ['sentence1', 'sentence2', 'label', 'idx'],\n",
       "        num_rows: 408\n",
       "    })\n",
       "    test: Dataset({\n",
       "        features: ['sentence1', 'sentence2', 'label', 'idx'],\n",
       "        num_rows: 1725\n",
       "    })\n",
       "})"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "raw_datasets = load_dataset(\"glue\", \"mrpc\")\n",
    "raw_datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "load_dataset出来的是一个DatasetDict对象，它包含了train，validation，test三个属性。可以通过key来直接查询，得到对应的数据集。\n",
    "\n",
    "这里的train，valid，test都是Dataset类型，有 features和num_rows两个属性。还可以直接通过下标来查询对应的样本。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'sentence1': 'Amrozi accused his brother , whom he called \" the witness \" , of deliberately distorting his evidence .',\n",
       " 'sentence2': 'Referring to him as only \" the witness \" , Amrozi accused his brother of deliberately distorting his evidence .',\n",
       " 'label': 1,\n",
       " 'idx': 0}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_train_dataset = raw_datasets['train']\n",
    "raw_train_dataset[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dataset的features可以理解为一张表的columns，Dataset甚至可以看做一个pandas的dataframe，二者的使用很类似。\n",
    "\n",
    "我们可以直接像操作dataframe一样，取出某一列："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "list"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(raw_train_dataset['sentence1'])  # 直接取出所有的sentence1，形成一个list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过Dataset的features属性，可以详细查看数据集特征，包括labels具体都是啥："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'sentence1': Value(dtype='string', id=None),\n",
       " 'sentence2': Value(dtype='string', id=None),\n",
       " 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),\n",
       " 'idx': Value(dtype='int32', id=None)}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_train_dataset.features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 数据集的预处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer\n",
    "tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以直接下面这样处理：\n",
    "```python\n",
    "tokenized_sentences_1 = tokenizer(raw_train_dataset['sentence1'])\n",
    "tokenized_sentences_2 = tokenizer(raw_train_dataset['sentence2'])\n",
    "```\n",
    "但对于MRPC任务，我们不能把两个句子分开输入到模型中，二者应该组成一个pair输进去。\n",
    "\n",
    "tokenizer也可以直接处理sequence pair："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'attention_mask': [1, 1, 1, 1, 1, 1, 1],\n",
      " 'input_ids': [101, 2034, 6251, 102, 2117, 2028, 102],\n",
      " 'token_type_ids': [0, 0, 0, 0, 1, 1, 1]}\n"
     ]
    }
   ],
   "source": [
    "from pprint import pprint as print\n",
    "inputs = tokenizer(\"first sentence\", \"second one\")\n",
    "print(inputs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'[CLS] first sentence [SEP] second one [SEP]'"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer.decode(inputs.input_ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到这里inputs里，还有一个`token_type_ids`属性，它在这里的作用就很明显了，指示哪些词是属于第一个句子，哪些词是属于第二个句子\n",
    "\n",
    "这种神奇的做法，其实也是源于bert-base预训练的任务，即**next sentence prediction**。换成其他模型，比如DistilBert，它在预训练的时候没有这个任务，那它的tokenizer的结果就不会有这个`token_type_ids`属性了。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "既然这里的tokenizer可以直接处理pair，我们就可以这么去分词："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenized_dataset = tokenizer(\n",
    "    raw_datasets[\"train\"][\"sentence1\"],\n",
    "    raw_datasets[\"train\"][\"sentence2\"],\n",
    "    padding=True,\n",
    "    truncation=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "但是这样不一定好，因为先是直接把要处理的整个数据集都读进了内存，又返回一个新的dictionary，会占据很多内存。\n",
    "\n",
    "官方推荐的做法是通过`Dataset.map`方法，来调用一个分词方法，实现批量化的分词："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "acc4da4c0f1d4c749535f86832149e6c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/4 [00:00<?, ?ba/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading cached processed dataset at C:\\Users\\Administrator\\.cache\\huggingface\\datasets\\glue\\mrpc\\1.0.0\\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\\cache-ef33b0c3c08e7836.arrow\n",
      "Loading cached processed dataset at C:\\Users\\Administrator\\.cache\\huggingface\\datasets\\glue\\mrpc\\1.0.0\\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\\cache-aadac9a568777e3c.arrow\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "DatasetDict({\n",
       "    train: Dataset({\n",
       "        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],\n",
       "        num_rows: 3668\n",
       "    })\n",
       "    validation: Dataset({\n",
       "        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],\n",
       "        num_rows: 408\n",
       "    })\n",
       "    test: Dataset({\n",
       "        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],\n",
       "        num_rows: 1725\n",
       "    })\n",
       "})"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def tokenize_function(sample):\n",
    "    # 这里可以添加多种操作，不光是tokenize\n",
    "    # 这个函数处理的对象，就是Dataset这种数据类型，通过features中的字段来选择要处理的数据\n",
    "    return tokenizer(sample['sentence1'], sample['sentence2'], truncation=True)\n",
    "\n",
    "tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)\n",
    "tokenized_datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "看看这个map的一些参数：\n",
    "\n",
    "```shell\n",
    "raw_datasets.map(\n",
    "    function,\n",
    "    with_indices: bool = False,\n",
    "    input_columns: Union[str, List[str], NoneType] = None,\n",
    "    batched: bool = False,\n",
    "    batch_size: Union[int, NoneType] = 1000,\n",
    "    remove_columns: Union[str, List[str], NoneType] = None,\n",
    "    keep_in_memory: bool = False,\n",
    "    load_from_cache_file: bool = True,\n",
    "    cache_file_names: Union[Dict[str, Union[str, NoneType]], NoneType] = None,\n",
    "    writer_batch_size: Union[int, NoneType] = 1000,\n",
    "    features: Union[datasets.features.Features, NoneType] = None,\n",
    "    disable_nullable: bool = False,\n",
    "    fn_kwargs: Union[dict, NoneType] = None,\n",
    "    num_proc: Union[int, NoneType] = None,  # 使用此参数，可以使用多进程处理\n",
    "    desc: Union[str, NoneType] = None,\n",
    ") -> 'DatasetDict'\n",
    "Docstring:\n",
    "Apply a function to all the elements in the table (individually or in batches)\n",
    "and update the table (if function does updated examples).\n",
    "The transformation is applied to all the datasets of the dataset dictionary.\n",
    "```\n",
    "\n",
    "关于这个map，在Huggingface的测试题中有讲解，这里搬运并翻译一下，辅助理解：\n",
    "\n",
    "What are the benefits of the Dataset.map method?\n",
    "- The results of the function are cached, so it won't take any time if we re-execute the code.\n",
    "\n",
    "    （通过这个map，对数据集的处理会被缓存，所以重新执行代码，也不会再费时间。）\n",
    "- It can apply multiprocessing to go faster than applying the function on each element of the dataset.\n",
    "\n",
    "    （它可以使用多进程来处理从而提高处理速度。）\n",
    "- It does not load the whole dataset into memory, saving the results as soon as one element is processed.\n",
    "\n",
    "    （它不需要把整个数据集都加载到内存里，同时每个元素一经处理就会马上被保存，因此十分节省内存。）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "观察一下，这里通过map之后，得到的Dataset的features变多了：\n",
    "```python\n",
    "features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids']\n",
    "```\n",
    "多的几个columns就是tokenizer处理后的结果。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注意到，在这个`tokenize_function`中，我们没有使用`padding`，因为如果使用了padding之后，就会全局统一对一个maxlen进行padding，这样无论在tokenize还是模型的训练上都不够高效。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dynamic Padding 动态padding\n",
    "\n",
    "实际上，我们是故意先不进行padding的，因为我们想**在划分batch的时候再进行padding**，这样可以避免出现很多有一堆padding的序列，从而可以显著节省我们的训练时间。\n",
    "\n",
    "这里，我们就需要用到`DataCollatorWithPadding`，来进行**动态padding**："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import DataCollatorWithPadding\n",
    "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注意，我们需要使用tokenizer来初始化这个`DataCollatorWithPadding`，因为需要tokenizer来告知具体的padding token是啥，以及padding的方式是在左边还是右边（不同的预训练模型，使用的padding token以及方式可能不同）。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面假设我们要搞一个size=5的batch，看看如何使用`DataCollatorWithPadding`来实现："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[50, 59, 47, 67, 59]"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "samples = tokenized_datasets['train'][:5]\n",
    "samples.keys()\n",
    "# >>> ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids']\n",
    "samples = {k:v for k,v in samples.items() if k not in [\"idx\", \"sentence1\", \"sentence2\"]}  # 把这里多余的几列去掉\n",
    "samples.keys()\n",
    "# >>> ['attention_mask', 'input_ids', 'label', 'token_type_ids']\n",
    "\n",
    "# 打印出每个句子的长度：\n",
    "[len(x) for x in samples[\"input_ids\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[67, 67, 67, 67, 67]"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "batch = data_collator(samples)  # samples中必须包含 input_ids 字段，因为这就是collator要处理的对象\n",
    "batch.keys()\n",
    "# >>> dict_keys(['attention_mask', 'input_ids', 'token_type_ids', 'labels'])\n",
    "\n",
    "# 再打印长度：\n",
    "[len(x) for x in batch['input_ids']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到，这个`data_collator`就是一个把给定dataset进行padding的工具，其输入跟输出是完全一样的格式。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'attention_mask': torch.Size([5, 67]),\n",
       " 'input_ids': torch.Size([5, 67]),\n",
       " 'token_type_ids': torch.Size([5, 67]),\n",
       " 'labels': torch.Size([5])}"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "{k:v.shape for k,v in batch.items()}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个batch，可以形成一个tensor了！接下来就可以用于训练了！"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "对了，这里多提一句，`collator`这个单词实际上在平时使用英语的时候并不常见，但却在编程中见到多次。\n",
    "\n",
    "最开始一直以为是`collector`，意为“收集者”等意思，后来查了查，发现不是的。下面是柯林斯词典中对`collate`这个词的解释：\n",
    "\n",
    "> **collate**: \n",
    ">\n",
    "> When you collate pieces of information, you **gather** them all together and **examine** them. \n",
    "\n",
    "就是归纳并整理的意思。所以在我们这个情景下，就是对这些杂乱无章长短不一的序列数据，进行一个个地分组，然后检查并统一长度。\n",
    "\n",
    "关于DataCollator更多的信息，可以参见文档：\n",
    "https://huggingface.co/transformers/master/main_classes/data_collator.html?highlight=datacollatorwithpadding#data-collator"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
